28 research outputs found
An algorithm to compute the power of Monte Carlo tests with guaranteed precision
This article presents an algorithm that generates a conservative confidence
interval of a specified length and coverage probability for the power of a
Monte Carlo test (such as a bootstrap or permutation test). It is the first
method that achieves this aim for almost any Monte Carlo test. Previous
research has focused on obtaining as accurate a result as possible for a fixed
computational effort, without providing a guaranteed precision in the above
sense. The algorithm we propose does not have a fixed effort and runs until a
confidence interval with a user-specified length and coverage probability can
be constructed. We show that the expected effort required by the algorithm is
finite in most cases of practical interest, including situations where the
distribution of the p-value is absolutely continuous or discrete with finite
support. The algorithm is implemented in the R-package simctest, available on
CRAN.Comment: Published in at http://dx.doi.org/10.1214/12-AOS1076 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Consistency of adjacency spectral embedding for the mixed membership stochastic blockmodel
The mixed membership stochastic blockmodel is a statistical model for a
graph, which extends the stochastic blockmodel by allowing every node to
randomly choose a different community each time a decision of whether to form
an edge is made. Whereas spectral analysis for the stochastic blockmodel is
increasingly well established, theory for the mixed membership case is
considerably less developed. Here we show that adjacency spectral embedding
into , followed by fitting the minimum volume enclosing convex
-polytope to the principal components, leads to a consistent estimate
of a -community mixed membership stochastic blockmodel. The key is to
identify a direct correspondence between the mixed membership stochastic
blockmodel and the random dot product graph, which greatly facilitates
theoretical analysis. Specifically, a norm and central
limit theorem for the random dot product graph are exploited to respectively
show consistency and partially correct the bias of the procedure.Comment: 12 pages, 6 figure
Posterior predictive p-values and the convex order
Posterior predictive p-values are a common approach to Bayesian
model-checking. This article analyses their frequency behaviour, that is, their
distribution when the parameters and the data are drawn from the prior and the
model respectively. We show that the family of possible distributions is
exactly described as the distributions that are less variable than uniform on
[0,1], in the convex order. In general, p-values with such a property are not
conservative, and we illustrate how the theoretical worst-case error rate for
false rejection can occur in practice. We describe how to correct the p-values
to recover conservatism in several common scenarios, for example, when
interpreting a single p-value or when combining multiple p-values into an
overall score of significance. We also handle the case where the p-value is
estimated from posterior samples obtained from techniques such as Markov Chain
or Sequential Monte Carlo. Our results place posterior predictive p-values in a
much clearer theoretical framework, allowing them to be used with more
assurance.Comment: 14 pages, 3 figure
Manifold structure in graph embeddings
Statistical analysis of a graph often starts with embedding, the process of
representing its nodes as points in space. How to choose the embedding
dimension is a nuanced decision in practice, but in theory a notion of true
dimension is often available. In spectral embedding, this dimension may be very
high. However, this paper shows that existing random graph models, including
graphon and other latent position models, predict the data should live near a
much lower-dimensional set. One may therefore circumvent the curse of
dimensionality by employing methods which exploit hidden manifold structure
Matrix factorisation and the interpretation of geodesic distance
Given a graph or similarity matrix, we consider the problem of recovering a
notion of true distance between the nodes, and so their true positions. We show
that this can be accomplished in two steps: matrix factorisation, followed by
nonlinear dimension reduction. This combination is effective because the point
cloud obtained in the first step lives close to a manifold in which latent
distance is encoded as geodesic distance. Hence, a nonlinear dimension
reduction tool, approximating geodesic distance, can recover the latent
positions, up to a simple transformation. We give a detailed account of the
case where spectral embedding is used, followed by Isomap, and provide
encouraging experimental evidence for other combinations of techniques
Implications of sparsity and high triangle density for graph representation learning
Recent work has shown that sparse graphs containing many triangles cannot be
reproduced using a finite-dimensional representation of the nodes, in which
link probabilities are inner products. Here, we show that such graphs can be
reproduced using an infinite-dimensional inner product model, where the node
representations lie on a low-dimensional manifold. Recovering a global
representation of the manifold is impossible in a sparse regime. However, we
can zoom in on local neighbourhoods, where a lower-dimensional representation
is possible. As our constructions allow the points to be uniformly distributed
on the manifold, we find evidence against the common perception that triangles
imply community structure
Hierarchical clustering with dot products recovers hidden tree structure
In this paper we offer a new perspective on the well established
agglomerative clustering algorithm, focusing on recovery of hierarchical
structure. We recommend a simple variant of the standard algorithm, in which
clusters are merged by maximum average dot product and not, for example, by
minimum distance or within-cluster variance. We demonstrate that the tree
output by this algorithm provides a bona fide estimate of generative
hierarchical structure in data, under a generic probabilistic graphical model.
The key technical innovations are to understand how hierarchical information in
this model translates into tree geometry which can be recovered from data, and
to characterise the benefits of simultaneously growing sample size and data
dimension. We demonstrate superior tree recovery performance with real data
over existing approaches such as UPGMA, Ward's method, and HDBSCAN